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Lattice QCD on a Beowulf Cluster 
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Using commodity component personal computers based on Alpha processor and commodity network devices 
On ■ and a switch, we built an 8-node parallel computer. GNU/Linux is chosen as an operating system and message 
passing libraries such as PVM, LAM, and MPICH have been tested as a parallel programming environment. We 
discuss our lattice QCD project for a heavy quark system on this computer. 
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Figure 1. Network configuration of the cluster 



Even a modest lattice QCD project demands 
quite large amount of computing resources. In 
this regard, it has always been an attractive idea 
to build a cheap high performance computing 
platform out of commodity PC's and commod- 
ity networking devices. However, the availability 
of cheap hardware components solved only a part 
of problem in building a parallel computer in the 
past. There were large hidden cost in construct- 
ing a do-it-yourself parallel computer and only 
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groups which could dedicate significant amount 
of resources were able to take advantage of this 
idea. Chief stumbling block has been in providing 
parallel programming environment (both in hard- 
ware and software) from the scratch and in main- 
taining one of a kind hardware. Following recent 
trend in do-it-yourself clustering technology 0, 
we built a cluster which uses only available hard- 
ware and software components and can be easily 
maintainable. Here, we discuss our experience. 

In terms of hardware, the node level configura- 
tion of our cluster does not differ from ordinary 
PC's other than the fact that it is monitor-less. 
Each node consists of a single 600 MHz Alpha 
21164 processor and SDRAM SIMM main mem- 
ory. The amount of memory on individual nodes 
varies from 128 Mbytes (5 nodes) to 256 Mbytes 
(2 nodes). SCSI hard disks on each nodes has ei- 
ther 2 Gbytes (4 nodes) or 4 Gbytes (4 nodes). 
Additionally, each node has CD-ROM drive and 
3 1/2 inch floppy drive. Power requirement of 
each node is 300 Watt. As a network component, 
each node has a 100 Mbps Ethernet card (3Com 
3C905). Node which serves as a front-end has 
one more 100 Mbps Ethernet card for outside con- 
nection. For the inter-processor communication, 
we use a 100 Mbps switched HUB (24 port Intel 
510T). Unlike the bus structure of a HUB, this 
inexpensive device allows simultaneous commu- 
nications among the nodes and offers a flexibility 
in communication topology. 

Fig. |l| shows the network configuration of our 
cluster. Since we use a switch, the communica- 
tion distance between any two nodes are the same 
unless the number of nodes becomes larger than 



2 



Figure 2. Network bandwidth (ping test). The 
vertical axis is Mbytes/sec and the horizontal axis 
is ping data size in bytes 
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Figure 3. Network bandwidth (round robin test). 
The vertical axis is Mbytes/sec and the horizontal 
axis is message size in bytes 



the number of available ports in the switch. To 
outside world, only node exists. All the nodes 
are assigned local subnet addresses (192.168.1.1 
- 192.16.1.8) where 198.168.x.x are reserved ad- 
dresses specifically for a private subnet and node 
acts as a gateway to the rest. In this way, we can 
increase the number of nodes in the cluster with- 
out worrying about available IP addresses. As a 
node operating system, we use Alzza Linux ver- 
sion 5.2a 1^ for an Alpha processor which is a Ko- 
rean customized version of Red Hat Linux 5.2 ^ 
with kernel version 2.2.1. Three different paral- 
lel programming environments, LAM (Local Area 
Multicomputer) version 6.1 @, MPICH (Mes- 
sage Passing Interface-Chameleon) version 1.2.2 
g , and PVM (Parallel Virtual Machine) version 
3.4 Q have been tested on our platform. These 
are all based on the message passing paradigm of 
parallel computing and use TCP/IP mechanism 
for the actual communication. Linux comes with 
FORTRAN and C compiler and the parallel pro- 
gramming environments offer wrappers for these 
languages. Since these parallel programming en- 
vironments use remote shell (rsh) for a parallel 
job execution, users need to have accounts on 
each nodes. NIS system is used for the pass- 
word validation. Hard disk space on each nodes 
has divided into three different partitions : one 
for local operating system, the other for NFS 
mounted '/home' directory and the third for a 
scratch space for large I/O operations. Instal- 
lation procedure consists of two parts : one for 
Linux operating system setup and the other for 
parallel programming library setup. Once Inter- 
net setup for each node is properly done, sub- 
net network can be established by just connecting 
Ethernet ports. The overall cost for building our 
8 node configuration is shown in Table ^ Cost for 
the console device such as a monitor, mouse and 
keyboard are not included since we use a used one 
(this table should be taken as a rough indication 
for the cost of our cluster since the component 
price changes quite rapidly). 

Since performance of a cluster is determined by 
(single node performance — system overhead due 
to inter-node communication) x the number of 
nodes, sustained speed of a single CPU and efh- 
ciency of network component play an important 
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Table 1 

Cost for our 8 node cluster 



component no. of components price/component Net price 

PC (memory, HDD and etc) 8 - 3000$ - 24, OOOf 

additional LAN card 1 70 $ 70 $ 

LAN cable 9 - 35 $ 

switch 1 2100 $ 2100 $ 

UPS (3Kwatt) 1 1560 $ 1560 $ 

total 27,660 T 



role in a cluster. Under GNU/Linux compiler, 
various tests showed that the sustained speed of 
a single Alpha processor is better than that of 
an Intel processor just by the difference in CPU 
clock speed. It is because Alpha 21164 proces- 
sor does not support out of order execution (un- 
der the same condition, Alpha 21264 which sup- 
ports out of order execution does better than 
Alpha 21164 by about factor two). Serial ver- 
sion of our quenched code for an 8^ x 32 lattice 
which is coded with SU (3) index as the inner- 
most loop and uses multi-hit and over-relaxation 
algorithm achieved 50 MFLOPS. Under Compaq 
FORTRAN compiler for Linux system (beta ver- 
sion), the same code achieved 91 MFLOPS (a 
code with long inner-most loop may do better 
under Compaq FORTRAN compiler by factor 
4 or more [^). In contrast, the same code on 
a 200 MHz Intel Pentium II MMX achieved 18 
MFLOPS under GNU/Linux compiler. This sin- 
gle node benchmark suggests that with the same 
device, we can take advantage of future develop- 
ment of compiler without further tuning of codes 
as GNU/Linux compiler improves. As for the net- 
work performance, we tested the network setup 
using two different methods. One is using "ping" 
test and the other is using "round-robin" com- 
munication, "ping" uses ICMP layer on top of 
IP layer and "round-robin" test uses TCP layer 
on top of IP layer. Fig. || shows "ping" test 
bandwidth and Fig. ^ shows "round-robin" test 
bandwidth of LAM parallel programming envi- 
ronment. We found that LAM does better than 
MPICH for short message and MPICH does bet- 
ter than LAM for large message. Although we 
have a dedicated network for our cluster sys- 
tem, three parallel programming environments 



we have tested all assume normal LAN environ- 
ment and use TCP/IP layer before the link layer 
in order to avoid various problems from sharing 
communication network. Further improvement 
in communication speed can be achieved if UDP 
layer with error handling is used instead of TCP. 
Under GNU/Linux, MPI parallel version of our 
quenched code for a 16^ x 32 lattice achieved 
346 MFLOPS with LAM and 378 MFLOPS with 
MPICH. Thus, communication overhead is about 
21 % for LAM and 14 % for MPICH. Paralle code 
has not been tested under Compaq FORTRAN 
compiler yet. 

Currently, we are generating full QCD config- 
urations on a 8^ X 32 lattice at (3 — 5.4 with 
m^a = 0.01 for heavy quark physics and we found 
that relatively cheap high performance comput- 
ing platform can be easily constructed and main- 
tained using all commodity software and hard- 
ware. 



REFERENCES 

1. Beowulf 



project. 



http://cesdis.gsfc.nasa.gov 
beowulf.html 
LinuxKorea 



/linux/beowulf/l 
Corp. 



http : / / www, korealinux . co . kr / . 
Red Hat Software 



mc. 



http : / / www, redhat . com / 



4. LAM/MPI 



Parallel 



Computing, 
http: / /www.mpi.nd.edu/lam . 
MPICH - A Portable MPI implementation. 



http: / /www. mcs.anl.gov / mpi / mpich/index.html 
PVM: Parallel Virtual Machine, 



http : / / www. epm . ornl. gov / pvm . 
D.S. Ryu, private communication. 



